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Abstract 

We present an algorithmic framework for 
learning multiple related tasks. Our frame- 
work exploits a form of prior knowledge that 
relates the output spaces of these tasks. We 
present PAC learning results that analyze the 
conditions under which such learning is pos- 
sible. We present results on learning a shal- 
low parser and named-entity recognition sys- 
tem that exploits our framework, showing con- 
sistent improvements over baseline methods. 

1 Introduction 

When two NLP systems are run on the same data, we 
expect certain constraints to hold between their out- 
puts. This is a form of prior knowledge. We propose 
a self-training framework that uses such information 
to significantly boost the performance of one of the 
systems. The key idea is to perform self-training 
only on outputs that obey the constraints. 

Our motivating example in this paper is the task 
pair: named entity recognition (NER) and shallow 
parsing (aka syntactic chunking). Consider a hid- 
den sentence with known POS and syntactic struc- 
ture below. Further consider four potential NER se- 
quences for this sentence. 



POS 


NNP NNP 


VBD TO NNP NN 


Chunk 


[- NP 


-][-VP-][-PP-][-NP-][-NP-] 


NER1 


[- Per 


-][- -][-Org-][- -] 


NER2 


[- Per 


-][- -][- -][- -][- -] 


NER3 


[- Per 


-][- -][- -][- Org -] 


NER4 


[- Per 


-][- -][- -][-Org-][- -] 



Without ever seeing the actual sentence, can we 
guess which NER sequence is correct? NER1 seems 



wrong because we feel like named entities should 
not be part of verb phrases. NER2 seems wrong be- 
cause there is an NNP^] (proper noun) that is not part 
of a named entity (word 5). NER3 is amiss because 
we feel it is unlikely that a single name should span 
more than one NP (last two words). NER4 has none 
of these problems and seems quite reasonable. In 
fact, for the hidden sentence, NER4 is correct] 

The remainder of this paper deals with the prob- 
lem of formulating such prior knowledge into a 
workable system. There are similarities between 
our proposed model and both self-training and co- 
training; background is given in Section [2] We 
present a formal model for our approach and per- 
form a simple, yet informative, analysis (Section [3]). 
This analysis allows us to define what good and 
bad constraints are. Throughout, we use a running 
example of NER using hidden Markov models to 
show the efficacy of the method and the relation- 
ship between the theory and the implementation. Fi- 
nally, we present full-blown results on seven dif- 
ferent NER data sets (one from CoNLL, six from 
ACE), comparing our method to several competi- 
tive baselines (Section |4]). We see that for many of 
these data sets, less than one hundred labeled NER 
sentences are required to get state-of-the-art perfor- 
mance, using a discriminative sequence labeling al- 
gorithm ( [Daume III and Marcu, 2005| ). 

2 Background 

Self-training works by learning a model on a small 
amount of labeled data. This model is then evalu- 

'When we refer to NNP, we also include NNPS. 

2 The sentence is: "George Bush spoke to Congress today" 



ated on a large amount of unlabeled data. Its predic- 
tions are assumed to be correct, and it is retrained 
on the unlabeled data according to its own predic- 
tions. Although there is little theoretical support 
for self-training, it is relatively popular in the natu- 
ral language processing community. Its success sto- 
ries range from parsing ( |McClosky et al., 2006 1 to 
machine translation (Ueffm gT 2006| ). In some cases, 
self-training takes into account model confidence. 



Co-training ( Yarowsky, 1995 ; Blum and Mitchell, 



1998) is related to self-training, in that an algorithm 
is trained on its own predictions. Where it differs is 
that co-training learns two separate models (which 
are typically assumed to be independent; for in- 
stance by training with disjoint feature sets). These 
models are both applied to a large repository of un- 
labeled data. Examples on which these two mod- 
els agree are extracted and treated as labeled for a 
new round of training. In practice, one often also 
uses a notion of model confidence and only extracts 
agreed-upon examples for which both models are 
confident. The original, and simplest analysis of co- 
training is due to Blum and Mitchell ( ] 1998] ). It does 
not take into account confidence (to do so requires a 
significantly more detailed analysis (Dasgupt a etal., 
2001[ l), but is useful for understanding the process. 



3 Model 



We define a formal PAC-style ( Valiant, 1994] ) model 
that we call the "hints model'j£ We have an instance 
space X and two output spaces 3^ and 3V We as- 
sume two concept classes C\ and C2 for each output 
space respectively. Let V be a distribution over X, 
and fi G C\ (resp., /2 G Ci) be target functions. The 
goal, of course, is to use a finite sample of examples 
drawn from V (and labeled — perhaps with noise — 
by /1 and f'2) to "learn" h\ G C\ and /12 G C2, which 
are good approximations to f\ and f2- 

So far we have not made use of any notion of con- 
straints. Our expectation is that if we constrain hi 
and /12 to agree (vis-a-vis the example in the Intro- 
duction), then we should need fewer labeled exam- 
ples to learn either. (The agreement should "shrink" 
the size of the corresponding hypothesis spaces.) To 
formalize this, let % : x — > {0, 1} be a con- 



straint function. We say that two outputs y\ G 3^1 
and U2 G 3^2 are compatible if xivi^y^) = 1- We 
need to assume that \ is correct: 
Definition 1. We say that x is correct with respect 
to T>, /1, /2 if whenever x has non-zero probability 
underV, then x(fi( x ), h( x )) = 1- 



Running Example 

In our example, ^i is the space of all POS/chunk 
sequences and is the space of all NER se- 
quences. We assume that C\ and C2 are both 
represented by HMMs over the appropriate state 
spaces. The functions we are trying to learn are fx, 
the "true" POS/chunk labeler and fa, the "true" 
NER labeler. (Note that we assume /1 e C\, 
which is obviously not true for language.) 

Our constraint function x will require the follow- 
ing for agreement: (1) any NNP must be part of a 
named entity; (2) any named entity must be a sub- 
sequence of a noun phrase. This is precisely the set 
of constraints discussed in the introduction. 



The question is: given this additional source of 
knowledge (i.e., has the learning problem be- 
come easier? That is, can we learn f2 (and/or f\) us- 
ing significantly fewer labeled examples than if we 
did not have %? Moreover, we have assumed that % 
is correct, but is this enough? Intuitively, no: a func- 
tion x that returns 1 regardless of its inputs is clearly 
not useful. Given this, what other constraints must 
be placed on x- We address these questions in Sec- 



tions 3.3 However, first we define our algorithm. 



3.1 One-sided Learning with Hints 

We begin by considering a simplified version of the 
"learning with hints" problem. Suppose that all we 
care about is learning f'2. We have a small amount of 
data labeled by f2 (call this D) and a large amount of 
data labeled by f\ (call this D unlab -"unlab" because 
as far as /2 is concerned, it is unlabeled). 



3 The name comes from thinking of our knowledge-based 
constraints as "hints" to a learner as to what it should do. 



Running Example 

In our example, this means that we have a small 
amount of labeled NER data and a large amount of 
labeled POS/chunk data. We use 3500 sentences 



from CoNLL (Tjong Kim Sang and De Meulder, 
|20"03] l as the NER data and section 20-23 of the 
WSJ ( |Marcus et al., 1993] [Ramshaw and Marcus? 
[T995] l as the POS/chunk data (8936 sentences). We 
are only interested in learning to do NER. Details 



of the exact HMM setup are in Section 4.2 



We call the following algorithm "One-Sided 
Learning with Hints," since it aims only to learn f2'. 
1: Learn h,2 directly on D 
2: For each example (x, y\) G £) uniab 
3: Compute y2 = h2(x) 
4: If x{Vi, 2/2), add (x, y 2 ) to D 
5: Relearn /12 on the (augmented) D 
6: Go to (2) if desired 

Running Example 

In step 1, we train an NER HMM on CoNLL. On 
test data, this model achieves an F-score of 50.8. 
In step 2, we run this HMM on all the WSJ data, 
and exttact 3145 compatible examples. In step 3, 
we retrain the HMM; the F-score rises to 58.9. 



3.2 Two-sided Learning with Hints 

In the two-sided version, we assume that we have a 
small amount of data labeled by f\ (call this D\), a 
small amount of data labeled by f'2 (call this D2) and 
a large amount of unlabeled data (call this D unlab ). 
The algorithm we propose for learning hypotheses 
for both tasks is below: 

1: Learn hi on D\ and /12 on D2. 

2: For each example x G £) unlab : 

3: Compute y\ = h\{x) and 2/2 = h,2(x) 

4: If x{Vi, 2/2) add (x, yi) to D u (x, y 2 ) to D 2 

5: Relearn h\ on D\ and /12 on D2. 

6: Go to (2) if desired 



3.3 Analysis 

Our goal is to prove that one-sided learning with 
hints "works." That is, if C2 is learnable from 
large amounts of labeled data, then it is also learn- 
able from small amounts of labeled data and large 
amounts of f\ -labeled data. This is formalized in 
Theorem [T] (all proofs are in Appendix [A]). How- 
ever, before stating the theorem, we must define an 



"initial weakly-useful predictor" (terminology from 
Blum and Mitchell( 1998)), and the notion of noisy 
PAC-learning in the structured domain. 
Definition 2. We say that h is a weakly-useful pre- 
dictor of f if for all y: Pro [h(x) = y] > e 
and Pre [f(x) = y | h(x) = y' ^ y) > 
Pi v [f{x) = y]+e. 

This definition simply ensures that (1) h is non- 
trivial: it assigns some non-zero probability to every 
possible output; and (2) h is somewhat indicative of 
/. In practice, we use the hypothesis learned on the 
small amount of training data during step (1) of the 
algorithm as the weakly useful predictor. 
Definition 3. We say that C is PAC-learnable with 
noise (in the structured setting) if there exists an 
algorithm with the following properties. For any 
c G C, any distribution T> over X, any < r/ < 
1/ \y\, any < e < 1, any < 5 < 1 and any 
V < Vo < V \y\> if the algorithm is given access 
to examples drawn EX^j^^, T>) and inputs e, 5 and 
t]q, then with probability at least 1—5, the algo- 
rithm returns a hypothesis h G C with error at most 
e. Here, EXg N (c, T>) is a structured noise oracle, 
which draws examples from T>, labels them by c and 
randomly replaces with another label with prob. r]. 

Note here the rather weak notion of noise: en- 
tire structures are randomly changed, rather than in- 
dividual labels. Furthermore, the error is 0/1 loss 
over the entire structure. Collins ( |2001| ) establishes 
learnability results for the class of hyperplane mod- 
els under 0/1 loss. While not stated directly in terms 
of PAC learnability, it is clear that his results apply. 
Taskar et al. ( |2005| ) establish tighter bounds for the 
case of Hamming loss. This suggests that the re- 
quirement of 0/1 loss is weaker. 

As suggested before, it is not sufficient for \ to 
simply be correct (the constant 1 function is cor- 
rect, but not useful). We need it to be discriminating, 
made precise in the following definition. 
Definition 4. We say the discrimination of x for h° 

is?Y V [ X {h{x)y{ X ))]-\ 

In other words, a constraint function is discrim- 
inating when it is unlikely that our weakly-useful 
predictor h° chooses an output that satisfies the con- 
straint. This means that if we do find examples (in 
our unlabeled corpus) that satisfy the constraints, 
they are likely to be "useful" to learning. 



Running Example 

We use 3500 examples from NER and 1000 from 
WSJ. We use the remaining 18447 examples as 
unlabeled data. The baseline HMMs achieve F- 
scores of 50.8 and 76.3, respectively. In step 2, we 
add 7512 examples to each data set. After step 3, 
the new models achieve F-scores of 54.6 and 79.2, 
respectively. The gain for NER is lower than be- 
fore as it is trained against "noisy" syntactic labels. 



Running Example 

In the NER HMM, let h° be the HMM obtained by 
training on the small labeled NER data set and /i 
is the true syntactic labels. We approximate Prjj 
by an empirical estimate over the unlabeled distri- 
bution. This gives a discrimination is 41.6 for the 
constraint function defined previously. However, if 
we compare against "weaker" constraint functions, 
we see the appropriate trend. The value for the con- 
straint based only on POS tags is 39.1 (worse) and 
for the NP constraint alone is 27.0 (much worse). 



Theorem 1. Suppose Ci is PAC-learnable with 
noise in the structured setting, is a weakly use- 
ful predictor of f% and x is correct with respect to 
T>, f\, fi, h®, and has discrimination > 2(\y\ — 1). 
Then Ci is also PAC-learnable with one-sided hints. 

The way to interpret this theorem is that it tells 
us that if the initial hi we learn in step 1 of the one- 
sided algorithm is "good enough" (in the sense that it 
is weakly-useful), then we can use it as specified by 
the remainder of the one-sided algorithm to obtain 
an arbitrarily good hi (via iterating). 

The dependence on \y\ is the discrimination 
bound for % is unpleasant for structured problems. If 
we wish to find M unlabeled examples that satisfy 
the hints, we'll need a total of at least 2M(\y\ - 1) 
total. This dependence can be improved as follows. 
Suppose that our structure is represented by a graph 
over vertices V, each of which can take a label from 
a set Y. Then, |^| = \Y V \, and our result requires 
that x be discriminating on an order exponential in 
V. Under the assumption that \ decomposes over 
the graph structure (true for our example) and that 
C2 is PAC-learnable with per-vertex noise, then the 
discrimination requirement drops to 2 |V| (\Y\ — 1). 



Running Example 

In NER, \Y\ = 9 and |V| ~ 26. This means 
that the values from the previous example look not 
quite so bad. In the 0/1 loss case, they are com- 
pared to 10 25 ; in the Hamming case, they are com- 
pared to only 416. The ability of the one-sided al- 
gorithm follows the same trends as the discrimi- 
nation values. Recall the baseline performance is 
50.8. With both constraints (and a discrimination 
value of 41.6), we obtain a score of 58.9. With just 
the POS constraint (discrimination of 39.1), we ob- 
tain a score of 58.1. With just the NP constraint 
(discrimination of 27.0, we obtain a score of 54.5. 



The final question is how one-sided learning re- 
lates to two-sided learning. The following definition 
and easy corollary shows that they are related in the 
obvious manner, but depends on a notion of uncor- 
relation between h\ and h%. 

Definition 5. We say that h\ and hi are un- 
corrected if Prp [h±(x) = yi I hi(x) = yi,x] = 
Pr© [h\{x) =y\\x\. 

Corollary 1. Suppose C\ and Ci are both PAC- 
learnable in the structured setting, h\ and h\ are 
weakly useful predictors of f\ and fi, and x is 
correct with respect to V, fi,fi,h® and h\, and 
has discrimination > 4(|3^| — l) 2 (for 0/1 loss) or 
>4|y| 2 (|Y|-l) 2 (for Hamming loss), and that h\ 
and h® are uncorrelated. Then C\ and Ci are also 
PAC-learnable with two-sided hints. 

Unfortunately, Corollary [T] depends quadratically 
on the discrimination term, unlike Theorem [T] 

4 Experiments 

In this section, we describe our experimental results. 
We have already discussed some of them in the con- 
text of the running example. In Section |4.1| we 
briefly describe the data sets we use. A full descrip- 
tion of the HMM implementation and its results are 
Finally, in Section |4.3| we present 



in Section 4.2 



results based on a competitive, discriminatively- 
learned sequence labeling algorithm. All results for 
NER and chunking are in terms of F-score; all re- 
sults for POS tagging are accuracy. 

4.1 Data Sets 

Our results are based on syntactic data drawn from 
the Penn Treebank ( Marcus et al., 1993) ), specifi- 
cally the portion used by CoNLL 2000 shared task 
( [Tjong Kim Sang and Buchholz, 2000] ). Our NER 
data is from two sources. The first source is the 



CoNLL 2003 shared task date (Tjong Kim Sang and 



De Meulder, 2003| ) and the second source is the 2004 



NIST Automatic Content Extraction (Weischedel, 



2004). The ACE data constitute six separate data 
sets from six domains: weblogs (wl), newswire 
(nw), broadcast conversations (be), United Nations 
(un), direct telephone speech (dts) and broadcast 
news (bn). Of these, be, dts and bn are all speech 
data sets. All the examples from the previous sec- 
tions have been limited to the CoNLL data. 



4.2 HMM Results 

The experiments discussed in the preceding sections 
are based on a generative hidden Markov model for 
both the NER and syntactic chunking/POS tagging 
tasks. The HMMs constructed use first-order tran- 
sitions and emissions. The emission vocabulary is 
pruned so that any word that appears < 1 time in the 
training data is replaced by a unique ^unknown* 
token. The transition and emission probabilities are 
smoothed with Dirichlet smoothing, a = 0.001 (this 
was not-aggressively tuned by hand on one setting). 
The HMMs are implemented as finite state models 
in the Carmel toolkit ( |Graehl and Knight, 2002] ) . 

The various compatibility functions are also im- 
plemented as finite state models. We implement 
them as a transducer from POS/chunk labels to NER 
labels (though through the reverse operation, they 
can obviously be run in the opposite direction). The 
construction is with a single state with transitions: 

• (NNR?) maps to B-* and I-* 

• (?,B-NP) maps to B-* and O 

• (?,I-NP) maps to B-*, I-* and O 

• Single exception: (NNP,x), where x is not an NP 
tag maps to anything (this is simply to avoid 
empty composition problems). This occurs in 
100 of the 212 A; words in the Treebank data and 
more rarely in the automatically tagged data. 

4.3 One-sided Discriminative Learning 

In this section, we describe the results of one-sided 
discriminative labeling with hints. We use the true 
syntactic labels from the Penn Treebank to derive 
the constraints (this is roughly 9000 sentences). We 



use the LaSO sequence labeling software (Daume III 
and Marcu, 2005| ), with its built-in feature set. 

Our goal is to analyze two things: (1) what is the 
effect of the amount of labeled NER data? (2) what 
is the effect of the amount of labeled syntactic data 
from which the hints are constructed? 

To answer the first question, we keep the 
amount of syntactic data fixed (at 8936 sentences) 
and vary the amount of NER data in N G 
{100,200,400,800,1600}. We compare models 
with and without the default gazetteer information 
from the LaSO software. We have the following 
models for comparison: 

• A default "Baseline" in which we simply train 
the NER model without using syntax. 





Hints Self-T 


Hints 




vs Base vs Base 


vs Self-T 


Win 


29 20 


24 


Tie 


6 12 


11 


Lose 


3 






Table 1 : Comparison between hints, self-ttaining and the 
(best) baseline for varying amount of labeled data. 

• In "POS-feature", we do the same thing, but we 
first label the NER data using a tagger/chunker 
trained on the 8936 syntactic sentences. These 

labels are used as features for the baseline. 

• A "Self-training" setting where we use the 
8936 syntactic sentences as "unlabeled," label 
them with our model, and then train on the 
results. (This is equivalent to a hints model 
where x('?") = 1 is the constant 1 func- 
tion.) We use model confidence as in Blum and 



Mitchell ( 1998 



The results are shown in Figure [T] The trends we 
see are the following: 

• More data always helps. 

• Self-training usually helps over the baseline 
(though not always: for instance in wl and parts 

of cts and bn). 

• Adding the gazetteers help. 

• Adding the syntactic features helps. 

• Learning with hints, especially for < 1000 
training data points, helps significantly, even 
over self-training. 

We further compare the algorithms by looking at 
how many training setting has each as the winner. In 
particular, we compare both hints and self-training 
to the two baselines, and then compare hints to self- 
training. If results are not significant at the 95% 
level (according to McNemar's test), we call it a tie. 
The results are in Table Q] 

In our second set of experiments, we consider the 
role of the syntactic data. For this experiment, we 
hold the number of NER labeled sentences constant 
(at N = 200) and vary the amount of syntactic data 
in M e {500, 1000, 2000, 4000, 8936}. The results 
of these experiments are in Figure[2] The trends are: 

• The POS feature is relatively insensitive to the 
amount of syntactic data — this is most likely 
because it's weight is discriminatively adjusted 

4 Results without confidence were significantly worse. 
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-* — POS-feature 
-x — Hints (no gaz) 
-x — Baseline (no gaz) 
-* — Hints (w/ gaz) 
-* — Baseline (w/ gaz) 
-0 — Self-train (no gaz) 
-A — Self-train (w/ gaz) 



1000 2000 1000 2000 1000 2000 

Figure 1: Results of varying the amount of NER labeled data, for a fixed amount (M = 8936) of syntactic data. 





Hints Self-T 


Hints 




vs Base vs Base 


vs Self-T 


Win 


34 28 


15 


Tie 


1 7 


20 


Lose 









Table 2: Comparison between hints, self-training and the 
(best) baseline for varying amount of unlabeled data. 



by LaSO so that if the syntactic information is 
bad, it is relatively ignored. 

• Self-training performance often degrades as the 
amount of syntactic data increases. 

• The performance of learning with hints in- 
creases steadily with more syntactic data. 

As before, we compare performance between the 
different models, declaring a "tie" if the difference 
is not statistically significant at the 95% level. The 
results are in Table [2] 

In experiments not reported here to save space, 
we experimented with several additional settings. In 
one, we weight the unlabeled data in various ways: 
(1) to make it equal-weight to the labeled data; (2) 
at 10% weight; (3) according to the score produced 
by the first round of labeling. None of these had a 



significant impact on scores; in a few cases perfor- 
mance went up by -C 1, in a few cases, performance 
went down about the same amount. 

4.4 Two-sided Discriminative Learning 

In this section, we explore the use of two-sided 
discriminative learning to boost the performance of 
our syntactic chunking, part of speech tagging, and 
named-entity recognition software. We continue to 



use LaSO (Daume III and Marcu, 2005) as the se 



quence labeling technique. 

The results we present are based on attempting to 
improve the performance of a state-of-the-art system 
train on all of the training data. (This is in contrast to 



the results in Section 4.3 in which the effect of us- 



ing limited amounts of data was explored.) For the 
POS tagging and syntactic chunking, we being with 
all 8936 sentences of training data from CoNLL. For 
the named entity recognition, we limit our presenta- 
tion to results from the CoNLL 2003 NER shared 
task. For this data, we have roughly 14k sentences 
of training data, all of which are used. In both cases, 
we reserve 10% as development data. The develop- 
ment data is use to do early stopping in LaSO. 

As unlabeled data, we use lm sentences extracted 
from the North American National Corpus of En- 
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Figure 2: Results of varying amount of syntactic data for a fixed amount of NER data (N = 200 sentences). 



glish (previously used for self-training of parsers 



(McClosky et al., 2006)). These lm sentences were 
selected by dev-set relativization against the union 
of the two development data sets. 

Following similar ideas to those presented by 



Blum and Mitchell (1998), we employ two slight 
modifications to the algorithm presented in Sec- 
First, in step (2b) instead of adding all 



3.2 



tion 

allowable instances to the labeled data set, we only 
add the top R (for some hyper-parameter R), where 
"top" is determined by average model confidence for 
the two tasks. Second, Instead of using the full un- 
labeled set to label at each iteration, we begin with 
a random subset of 10 R unlabeled examples and an- 
other add random 10 R every iteration. 

We use the same baseline systems as in one-sided 
learning: a Baseline that learns the two tasks inde- 
pendently; a variant of the Baseline on which the 
output of the POS/chunker is used as a feature for 
the NER; a variant based on self-training; the hints- 
based method. In all cases, we do use gazetteers. We 
run the hints-based model for 10 iterations. For self- 
training, we use 10 R unlabeled examples (so that it 
had access to the same amount of unlabeled data as 
the hints-based learning after all 10 iterations). We 
used three values of R: 50, 100, 500. We select the 





Chunking 


NER 
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94.2 


87.5 


w/POS 
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88.0 


Self-train 
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94.2 


88.0 
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88.6 
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94.1 


88.4 


Hints 






R = 50 


94.2 


88.5 


R = 100 


94.3 


89.1 


R = 500 


94.3 


89.0 



Table 3: Results on two-sided learning with hints. 

best-performing model (by the dev data) over these 
ten iterations. The results are in Table [3] 

As we can see, performance for syntactic chunk- 
ing is relatively stagnant: there are no significant 
improvements for any of the methods over the base- 
line. This is not surprising: the form of the con- 
straint function we use tells us a lot about the NER 
task, but relatively little about the syntactic chunking 
task. In particular, it tells us nothing about phrases 
other than NPs. On the other hand, for NER, we see 
that both self-training and learning with hints im- 
prove over the baseline. The improvements are not 



enormous, but are significant (at the 95% level, as 
measured by McNemar's test). Unfortunately, the 
improvements for learning with hints over the self- 
training model are only significant at the 90% level. 

5 Discussion 

We have presented a method for simultaneously 
learning two tasks using prior knowledge about the 
relationship between their outputs. This is related 



to joint inference (Daume III et al., 20061. How 



ever, we do not require that that a single data set 
be labeled for multiple tasks. In all our examples, 
we use separate data sets for shallow parsing as for 
named-entity recognition. Although all our exper- 
iments used the LaSO framework for sequence la- 
beling, there is noting in our method that assumes 
any particular learner; alternatives include: condi- 
tional random fields (Laffer ty"et al., 2001) , indepen- 
dent predictors ( Punyakanok and Roth, 200T) , max- 
margin Markov networks (Tas kar et al., 2005"] ), etc. 
Our approach, both algorithmically and theoreti- 



cally, is most related to ideas in co-training (Blum 



and Mitchell, 1998 1. The key difference is that in 
co-training, one assumes that the two "views" are 
on the inputs; here, we can think of the two out- 
put spaces as being the difference "views" and the 
compatibility function \ being a method for recon- 
ciling these two views. Like the pioneering work 
of Blum and Mitchell, the algorithm we employ in 
practice makes use of incrementally augmenting the 
unlabeled data and using model confidence. Also 
like that work, we do not currently have a theoret- 
ical framework for this (more complex) modelj^] It 
would also be interesting to explore soft hints, where 
the range of x is [0,1] rather than {0,1}. 

Recently, Ganchev et al. (2008) proposed a co- 
regularization framework for learning across multi- 
ple related tasks with different output spaces. Their 
approach hinges on a constrained EM framework 
and addresses a quite similar problem to that ad- 
dressed by this paper. Chang et al. (2008} also 
propose a "semisupervised" learning approach quite 
similar to our own model. The show very promis- 
ing results in the context of semantic role labeling. 



5 Dasgupta et al. i 2001} proved, three years later, that a for- 
mal model roughly equivalent to the actual Blum and Mitchell 
algorithm does have solid theoretical foundations. 



Given the apparent (very!) recent interest in this 
problem, it would be ideal to directly compare the 
different approaches. 

In addition to an analysis of the theoretical prop- 
erties of the algorithm presented, the most com- 
pelling avenue for future work is to apply this frame- 
work to other task pairs. With a little thought, one 
can imagine formulating compatibility functions be- 
tween tasks like discourse parsing and summariza- 



tion (Marcu, 2000), parsing and word alignment, or 
summarization and information extraction. 
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A Proofs 

The proof of Theorem[T]closes follows that of Blum 
and Mitchell ([1998). 

Proof (Theorem^ sketch). Use the following nota- 
tion: c fc = PvT,[h(x) = k],pi = Pr v [f(x) = I), 
qi\ k = Pr-p[f(x) = I | h(x) = k]. Denote by A 
the set of outputs that satisfy the constraints. We are 
interested in the probability that h(x) is erroneous, 
given that h(x) satisfies the constraints: 

P (h(x)eA\{l}\f(x) = l) 

= X P(H X ) = k \ f( x ) = l ) = X C k<ll\k/Pl 
keA\{l} keA\{l} 

< x c x(\y\ 1 + e X vpo ^ 2 X c *(\y\ - *) 



keA 



l^k 



keA 



Here, the second step is Bayes' rule plus definitions, 
the third step is by the weak initial hypothesis as- 
sumption, and the last step is by algebra. Thus, in 
order to get a probability of error at most i], we need 
EkeA = Pr[%) €A]< V/W\ - I))- □ 
The proof of Corollary [T]is straightforward. 

Proof (Corollary^ sketch). Write out the probabil- 
ity of error as a double sum over true labels y\ , 1/2 
and predicted labels y 1,1/2 subject to x{vi-, V2)- Use 
the uncorrelation assumption and Bayes' to split this 
into the product two terms as in the proof of Theo- 
remQ] Bound as before. □ 
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